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(§5) Speech recognition by neural network adapted to reference pattern learning. 



(57) A speech recognition method according to 
the present invention uses distances calculated 
through a variance weighting process using 
covanance matrixes as the local distances (pre- 
diction residuals) between the feature vectors 
of input syllables/sound elements and predicted 
vectors formed by different statuses of refer- 
ence neural prediction models (NPM's) using 
finite status transition networks. The category 
to minimize the accumulated value of these 
local distances along the status transitions of alt 
the prediction models is figured out by dynamic 
programming, and used as the recognition out- 
put Learning of the reference prediction mod- 
els used in this recognition method is 
accomplished by repeating said distance cal- 
culating process and the process to correct the 
parameters of the different statuses and the 
covanance matrixes of said prediction models 
in the direction of reducing the distance be- 
tween the learning patterns whose category is 
known and the prediction models of the same 
category as this known category, and what have 
satisfied prescribed conditions of convergence 
through these calculating and correcting pro- 
cesses are determined as reference pattern 
models. 
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The present invention relates to a speech recognition method, and more particularly to a speech recognition 
method manifesting a high rate of recognition without requiring learning with a particularly large quantity of train- 
ing data. 

Speech signals are expressed in time series patterns of feature vectors, and speech recognition is based 
5 on the degree of identity between a reference pattern representing a known piece of speech and the pattern 
of the input speech signal. For these time series patterns, the Hidden Markov Mode! (HMM) is extensively used 
as described in detail in the specifications of the U.S. Patents Nos. 4,587,670 and 4,582,180. The HMM itself 
will not be explained in detail here because its detailed description can be found in S.E. Levinson, "Structural 
Method in Automatic Speech Recognition", Proc. IEEE, 73, No. 11, 1985, pp. 1625-1650, besides said U.S. 
w Patents. 

The HMM assumes that the time series of feature vectors are generated by the Markov probability process. 
The standard patterns of the HMM are represented in a plurality of statuses and transitions between the sta- 
tuses, and each status outputs a feature vector according to a predetermined distribution of probability density 
while each transition between statuses is accompanied by a predetermined probability of transition. The like- 

15 lihood, which represents the degree of matching between the input pattern and a reference pattern, is given 
by the probability of the Markov probability model to generate a series of input pattern vectors. The probability 
of transition between statuses and the parameter to define the function of probability density distribution, which 
characterize each reference pattern, can be determined with the Baum Welch algorithm using a plurality of sets 
of vocalization data for the training purpose. 

20 However, the Baum Welch algorithm, which is a statistical learning method, requires a large quantity of 

training data to determine the parameters of the model corresponding to reference patterns. Therefore, the load 
of vocalization is extremely great when a speech recognition apparatus begins to be newly used, and this pres- 
ents a serious obstacle to the practical use of such apparatuses. Therefore, with a view to reducing this load, 
a number of speaker-adaptive methods have already been proposed to adapt a speech recognition apparatus 

25 to the speaker with a relatively small quantity of training data. 

A speaker-adaptive method defines the similarity of acoustic events according to reference patterns cor- 
responding to known speech signals and a new speaker's vocalization data for adaptation, basically using the 
physical distance between feature vectors as the scale, and carries out adaptation by estimating, on the basis 
of that similarity, the parameters of the model corresponding to acoustic events absent in the vocalization data 

30 for adaptation. 

However, such a method of adaptation based on an estimation relying solely on physical distances, though 
providing a somewhat higher rate of recognition than before the adaptation, is far less effective in recognition 
than a method using reference patterns corresponding to a specific speaker, consisting of a large quantity of 
speech data. (For further details, see K. Shikano, K.F. Lee and R. Reddy, "Speaker Adaptation through Vector 

35 Quantization", Proc. ICASSP-86, Tokyo, 1986, pp. 2643-2646.) 

Meanwhile, as means for improving the rate of recognition, the inventors of the present invention proposed 
a pattern recognition method based on the prediction of the aforementioned time series patterns. Using multi- 
layer perceptrons (MPL's) based on a neural network as predictive means for the time series patterns, the out- 
puts of the MLP's constitute reference patterns. The inventors named the reference patterns the "neural pre- 

40 diction model" (NPM). This NPM will not be described in detail here as its detailed explanations can be found 
in K. Iso and T. Watanabe, "Speaker-Independent Word Recognition Using a Neural Prediction Model," Proc. 
ICASSP-90, New Mexico, 1990, p. 441-444 and the pending U.S. Patent Ser. No. (07-521625). In the NPM 
described in these references, a predictor (MLP) in the nth status of a reference pattern model consisting of a 
finite status transition network calculates a predicted for the feature vector of the input patterns at time t from 

45 a plurality of feature vectors at time and before. The distance between this predicted vector and the feature 
vector of the input pattern at time t is supposed to be the local distance between said two feature vectors. In 
the NPM described in the above cited references, the squared distance or the like between the vectors is used 
as this local distance. 

50 BRIEF SUMMARY OF THE INVENTION 

Object of the Invention 

An object of the present invention is to reduce, in relative terms, the contributions of components inferior 
55 in predictive accuracy (i.e. more fluctuating) out of the components of said predicted vectors, thereby increase 
the predictive accuracy of the predictor and improve the accuracy of recognition by the NPM. 
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Summary of the Invention 

A pattern recognition method according to the invention recognizes the time series patterns of feature vec- 
tors representing input speech signals by using the NPM constituting said finite status transition network. Each 

5 status of this finite status transition network has a predictor for calculating a predicted vector from a plurality 
of feature vectors of the input time series patterns at time and before and a plurality of feature vectors at 
time t+1 and after. This predicted vector is compared with the feature vector of the input time series patterns 
at time t. As said local distance indicated by the result of this comparison, i.e. the local distance between the 
feature vector of the input time series patterns at time t (input feature vector) and the nth status of the finite 

10 status transition network (predicted feature vector), there is used the prediction residual calculated from the 
input feature vector, the predicted feature vector and a covariance matrix accom panying said nth status in ad- 
vance. The total difference between said input time series patterns and the reference pattern model is given 
by a cumulative value following said status transition of said local distance. This cumulative value is calculated 
for every category of the reference pattern model, and the category having the smallest cumulative value is 

15 selected as the recognition output 

The NPM according to the present invention composes said reference pattern model by learning. First, the 
initial values are set for the parameters of said predictor and covariance matrix accompanying each status of 
said finite status transition network. Next, said total distance between the learning pattern whose category is 
known and the reference pattern model of the same category as said category is calculated, and the parameters 

20 of the predictor and covariance matrix of each state are corrected in the direction of reducing said total distance 
without fail by a predetermined algorithm. This correction is repeated, and the pattern model satisfying prede- 
termined conditions of convergence is eventually selected as the reference pattern model. 

Brief Description of the Drawings 

25 

The above-mentioned and other objects, features and advantages of the present invention will become 
more apparent by reference to the following detailed description of the invention taken in conjunction with the 
accompanying drawings, wherein: 

FIG. 1 illustrates the configuration of the multilayer perceptrons (MLP's) used as the predictor in the inven- 
30 tion; 

FIG. 2 illustrates the finite status transitions of an NPM, which constitutes the reference pattern model ac- 
cording to the invention; 

FIG. 3 illustrates the configuration of the recognition algorithm according to the invention; 
FIG. 4 is a recognition flow chart illustrating the pattern recognition method according to the invention; 
35 FIG. 5 is a detailed flow chart of the initializing section of FIG. 4; 

FIG. 6 is a detailed flow chart of the local distance calculation in FIG. 4; 

FIG. 7 is a flow chart illustrating the reference pattern learning method according to the invention; 
FIG. 8 is a block diagram of a speech recognition apparatus which is a preferred embodiment of the in- 
vention; 

40 FIG. 9 is a detailed flow chart of predicted vector calculation at step 601 in FIG. 6; 

FIG. 10 is a detailed flow chart of local distance calculation at step 602 in FIG. 6; 

FIG. 1 1 is a detailed flow chart of initializing at step 701 in FIG. 7; 

FIG. 12 is a detailed flow chart of optimaJ trajectory calculation at step 704 in FIG. 7; 

FIG. 13 is a detailed flow chart of the calculation of the quantities of parameter correction at step 706 in 
45 FIG. 7; 

FIG. 14 is a detailed flow chart of covariance matrix calculation at step 71 1 in FIG. 7; and 
FIG. 15 is a detailed flow chart of convergence decision at step 712 in FIG. 7. 

GENERAL DESCRIPTION 

50 

To explain the basic principle of speech recognition according to the present invention with reference to 
FIG. 1, said predictor used in the invention consists of MLP's. As described in detail in M. Funahashi, "On the 
Approximate Realization of Continuous Mappings by Neural Networks". Neural Networks , Vol. 2, 1989, pp. 183- 
192, MLP's approximate any (nonlinear) continuous function at any desired accuracy. 

55 In the figure, the time series patterns to be inputted to the MPL's consist feature vectors a^ a v1 for 

"forward prediction" and a^,, .... a^ for "backward prediction". The tetter's prediction backward on the time 
axis is added to the former's forward prediction to improve the predictive accuracy of the time series patterns 
which have dose correlation backward on the time axis. As the plosive part of a plosive sound, for instance, 
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is more closely correlated to the transitional part to the following vowel than to the closed section before the 
plosion, this backward prediction proves effective for plosive sounds. 

The output pattern of MLP's is the predicted vector a t for the feature vector a t of input speech at time t. 
This predicted vector can be represented by the following equations, using the input-output relationship of 
5 MLP's: 



70 -S»f J-/ 

a t = W 0 h t + e 0 (2) 

where W 0 , W£, W? t w£ are the matrix of coupling coefficients between the MLP units; 0 O and © 1t 

threshold vectors; and f(-). a vector obtained by applying the sigmoid function to each component of the sub- 
trahend vector. 

As stated above, composing a predictor of MLP's makes it possible to describe the causal relationship be- 
tween proximate features vectors in the time series of speech feature vectors as a nonlinear image formed by 
the MLP's. The relative accuracy of the prediction can be evaluated by using the predicted vector a 1( which is 
20 the output of the MLP's, and the feature vector a t of the actual input speech as the prediction residual. 

An NPM which constitutes the reference pattern model of a basic unit of speech recognition, such as the 
word or the syllable, is represented by a transition network of finite (in this case four) statuses (finite statuses 
including 201 through 204), and each status consists of said predictor composed of MLP's. An NPM represent- 
ing a greater unit (such as the sentence) can be composed by connecting many NPM's for basic recognition 
25 units. 

Next, the recognition algorithm using the NPM's basically derives from pattern matching between input 
speech and a reference pattern model. The reference pattern model for discrete recognition is an NPM for a 
basic recognition unit, while that for continuous recognition is an NPM obtained by connecting basic unit NPMs, 
and in both cases it is a finite status transition network accompanied by said MLP predictor. According to the 
30 present invention, continuous recognition is accomplished by taking note of the sound elements of speech sig- 
nals, and therefore of the close correlation on the time axis of said feature vectors. For this reason, said finite 
status transition network is composed in a I eft- to- right pattern as shown in FIG. 2. 

The distance (local distance) d t (n) between the feature vector of input speech at time t and the nth status 
of an NPM is given by the following equation: 

35 

d t (n) = (a t - S t (n)) r ^^(a t - a t (n)) + In \Z n ! (3) 
40 where a t (n) is the predicted vector by the MLP predictor in the nth status and 




45 

the covariance matrix in the nth status. The prediction residual is represented by a\(n), and 

50 

in the equation is a quantity introduced to normalize the t different extents of fluctuation of the prediction residual 
from component to component of the feature vector. Equation (3) can be interpreted as the logarithmic prob- 
ability obtained when the probability at which the feature vector a t is observed in the nth status of the NPM is 
55 approximated by a Gaussian distribution represented by the following equation: 
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(P(n|a t )^^= 21t) l | ^ ( exp[-i(a t - ^ ( n ) ) f l^' (a t - ^(n)) 

(4) 

If the nondiagonal terms of the covariance matrix 

£„ 

are negligibly small, equation (3) can be approximated by the following equation: 

d t (nj = t + |lin^ c (5) 

where subscript c represents a component of a C-dimensional feature vector, and aL the cth diagonal com- 
ponent of the covariance matrix 

£„- 

Further, if a£ = 1 (the covariance matrix is a unit matrix) here, equation (3) can be simplified into the following 
equation. 

d t (n) = I I at - a t (n)| I 2 (6) 
This equation (6) is the scale of distance used in NPM's according to the prior art, in which differences in the 
extent of fluctuation of the prediction residual from component to component of the feature vector are not taken 
into account 

The distance (global distance) D between input speech and an NPM is given by the following equation as 
the cumulation of local distances: 



D = min 21 d «-( n i-> ( ? ) 
40 W r r 

where rit is the identification number of the NPM which performs prediction of the feature vector of input speech 
at time t. The minimization in equation (7) means the selection, out of possible trajectories n 1f n 2 , rn, ... n T 
(possible status transitions on the finite status transition network) between input speech and an NPM, of what 
will minimize the global distance (accumulated prediction residual) D. Where the skipless left-to-right pattern 
shown in FIG. 2 is to be used as the NPM, n t should satisfy the following constraints: 

ni = 1 (8) 
n T = N (9) 
nt = n t . t orn t . n + 1 (1 < t^T) (10) 
where T is the length of the feature vector time series patterns of input speech signals, and N, the number of 
NPM statuses (the identification number of the final status). Under these constraints, the problem of minimiza- 
tion can be solved by dynamic programming (DP) using the following recursion formula (for details on DP, ref- 
erence may be made to H. Sakoe and S. Chiba, "Dynamic Programming Algorithm Optimization for Spoken 
Word Recognition", IEEE Transaction on Acoustics, Speech, and Signal Processing, ASSP-26 (1), February 
1978, pp. 43-49): 
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g^-(n) = d^fn) + min - 



g t-l (n) 
g^(n-l) 



(11) 



where g t (n) is the partial sum of local distances d t (n), and the global distance D is given by the following equa- 
tion. 

D = gr(N) 

By tracing back the results, the optimal trajectory {n;} to minimize the accumulated prediction residual can be 
obtained. This information is used in the training algorithm to be described below. In recognizing continuous 
speech or the like, the word sequence of the recognition result can be identified from this information. FIG. 3 
illustrates an outline of the recognition algorithm so far described. 

Next will be described the training algorithm for automatically determining the parameters of NPM's (the 
weighting parameter for the neural network constituting the predictor and the parameter of the covariance ma- 
trix) by using known speech data. The purpose of training is to find out such model parameters as would min- 
imize the aforementioned accumulated prediction residual for the speech data for training use. This can be for- 
mulated as follows as a problem of minimization having as its evaluative function the total of the accumu- 
lated prediction residual for the whole training speech data: 



D 



tota 



i = 2L D ( m > d3) 



where M is the total number of the sets of training data, and D(m), the accumulated prediction residual for the 
m-th set of training data. D(m) can be calculated by the algorithm using DP, formulated in the preceding section. 
The evaluative function D totaJ can be minimized in the optimal manner by the iterative algorithm stated below, 
combining DP and back-propagation (BP) (for more details on BP, reference may be made to R.P. Lippmann, 
"An Introduction to Computing with Neural Nets", IEEE ASSP Magazine, 3, 1987, pp. 4-22). 

Step 1: Initialize all the NPM parameters (including the inter-unit coupling coefficient matrixes, threshold 

vectors and covariance matrixes of all the MLP predictors) 

Step 2: m = 1 

Step 3: Calculate the accumulated prediction residual D(m) for the mth set of training data by DP. Seek 
for the optimal trajectory {n J } by back-tracking. 
Step 4: t = 1 

Step 5: Assign a desirable output a t to the output output a t (n ;) of the n ; th MLP predictor of reference pat- 
terns, and calculate the correction quantity of each parameter by BP. 
Step 6: t = t + 1 

Step 7: If t is not greater than T m (T m is the the number of frames of the mth set of training data, return to 
step 5. 

Step 8: m = m + 1 

Step' 9: If m is not greater than M, return to step 3. 

Step 10: Update all the NPM parameters according to the correction quantities calculated at step 5. 

Step 11: If the conditions of convergence are not satisfied, return to step 2. 
While parameter corrections by BP in the foregoing algorithm use the determined steepest descent method by 
which all the corrections are done collectively at step 10, the corrections can as well be consecutively accom- 
plished by the random steepest descent method. Regarding the conditions of convergence at step 1 1 , the con- 
vergence is deemed to have been achieved when, for instance, the decrement of the evaluative function D,^, 
drops below a certain level. 

When a covariance matrix is introduced into the scale of local distances, the quantity of back-propagation 
error by BP requires the following correction (incidentally, the nondiagonai terms of the covariance matrix are 
small enough to ignore). The amount of the back- propagation error of the c-th unit of the output layer of an 
MLP predictor in the n ;-th status (a^n^n ;)) is: 
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<f tv < n *) = <r%'"" (14) 

5 

This differs from the amount of back-propagation error taking account of no covariance matrix by the reciprocal 
of variance. The estimated value of the covariance matrix is so determined as to minimize the evaluative func- 
tion Thus from the following optimizing condition: 

^to*! = 0 (15) 

is derived the following estimation formula (revaluation formula for use at step 10) of the covariance matrix: 
is Cf nv = •■ m tv* (16) 
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where T m is the number of frames of the m-th set of training data, a nn ; , a Kronecker delta. 

1 if m = n 



S mn = ] n _ . (17) 
0 otherwise 



The convergence of the foregoing iterative training algorithm can be proven in the following way. Thus, the 
value D*^ of the evaluative function in the k-th iteration before the parameter correction (immediately before 
step 10) is the sum of prediction residuals accumulated according to the optimal (accumulated prediction re- 
sidual minimizing) trajectory {n ;} determined by DP for each set of training data. The sum of prediction residuals 
accumulated according to the same trajectory after the parameter correction at step 10 is represented by 
D ££P- Whereas the parameter correction by BP here is so accomplished as to reduce the squared error in the 
output layer of each MLP predictor, in the case of NPM, where this squared error is identical with the prediction 
residual, the accumulated prediction residual is reduced without fail by the parameter correction. (Revaluation 
35 of the covariance matrix is considered together with BP.) 

(18) 

However, when the parameters are corrected by BP, the optimal ity of the optimal trajectory achieved at step 
3 is lost Therefore, the optimal trajectory is sought for the model parameters corrected by DP in the k+1-th 
iteration. Since DP gives the optimal trajectory to minimize the accumulated predictor residual: 

formulas 18 and 19 eventually indicate that the evaluative function is monotonously reduced by iteration. 

D &? - D £a, (20) 

Qualitatively it can be understood that the reason why this iterative algorithm converges is that DP and BP 
are minimization methods for the same evaluative function (the accumulated sum of prediction residuals) and 
they are consecutively applied. 

DETAILED DESCRIPTION 



The present invention will be described in further detail below with reference to FIGS. 4 to 6 which are the 
flow charts of recognition by the speech recognition method according to the invention, FIG. 7 which is a flow 
chart of reference pattern learning by the speech recognition method according to the invention and FIG. 8 
which is a block diagram of a speech recognition apparatus according to the invention. 

A speech input unit 8101 in FIG. 8, consisting of a microphone, an amplifier and an analog- to-digital (A/D) 
converter among other things, digitalizes speech signals representing speech sounds uttered by the user and 
supplies them to the following acoustic analyzer 8102. The acoustic analyzer 8102 subjects these digitaiized 
speech signals to spectral analysis by FFT or the like, and converts them into a time series pattern of feature 
vectors. The spectral analysis can be accomplished by linear predictive coding (LPC) or the cepstrum method 



7 



EP 0 510 632 A2 



besides FFT. 



A reference pattern storage section 8103 stores the parameters of the reference pattern models of all the 
word categories which are the objects of recognition. If, for instance, 10 numerals are to be recognized, the 
parameters of the reference pattern model of each of the numerals from 0 through 9 are stored. The reference 
5 pattern model of each category here is a finite status transition network each of whose statuses is accompanied 
by an MLP predictor. 

What are stored in the storage section 81 03 are the parameters of the MLP predictors of different statuses 
and the covariance matrixes of the respective statuses. Where MLP predictors each having one hidden layer, 
as shown in FIG. 1 , are used, the parameters are t f inter-unit coupling coefficient matrixes Wf W^. (each 

10 matrix consists of H rows by C columns, where C is the number of hidden layer units and C is the number of 

dimensions of the feature vector) for forward prediction. t b inter-unit coupling coefficient matrixes W? W* B 

(each matrix consists of H rows by C columns) for backward prediction, an inter-unit coupling coefficient matrix 
W (consisting of C rows by H columns), the threshold vector © t , of the hidden layer (H-dimensional vector), 
and the threshold vector © 0 of the output layer (C-dimensionai vector). Each covariance matrix is a symmetric 

15 one of C rows by C columns, and the number of independent components is C(C + 1)/2. 

A distance calculator 8104 calculates the distance between the feature vector time series pattern given 
from the acoustic analyzer 81 02 and the reference pattern model of each of the aformentioned categories, and 
supplies the result of calculation to a recognition result output section 8105. The distance between the feature 
vector time series pattern of input speech and the reference pattern model is the global distance D defined by 

20 the above cited equation (7). The method of calculation is given by the flow from steps 401 to 414 in FIG. 4. 
In FIG. 4, t is a variable representing the time axis of the feature vector time series pattern of input speech, 
and takes one of integral values from 1 through T; s is a variable representing the category of the objects of 
recognition, and takes one of integral values from 1 through S - where the objects of recognition are 10 numer- 
als, s = 10; n is a variable representing the status of the reference pattern model of each category, and takes 

25 one of integral values from 1 through N< s > (N< s > is the number of status of the reference pattern model of category 
s); d*?(n) is a variable for storing the local distance between the n-th status of category s and the feature vector 
at of input speech at time t; and g ( ?(n) is a variable for storing the accumulated prediction residual of the n-th 
status of category s at time t. 

30 At step 401, the variables are initialized in order to calculate the global distance D of equation (7) by DP. 

Referring to FIG. 5 illustrating this step 401 in detail, a counter is initialized at steps 501 to 503. At step 504, 
the storage areas for the local distance d st (n) and the accumulated prediction residual g st (n) are initialized. At 
steps 505 to 510, the increment and conditions of the counter are judged, and the initialization at step 504 is 
applied with respect to all the values of s, t and n. Then, at steps 51 1 to 514, the value at the start point of the 

33 accumulated prediction residual of each category s is set 

Next, referring to FIG. 6 illustrating step 405 for calculating the local distance in detail, at step 601 , an MLP 
predictor accompanying the n-th state of category s calculates the predicted vector a t to be compared with the 
input speech feature vector at at time t. In FIG. 9 which shows in further detail this calculation represented by 
the above cited equations (1) and (2), X is a scalar variable; Y, the arrangement in the H-dimension (Y(h) is 

40 the h-th element); Z, the arrangement in the C-dimension (2(c) is the c-th element); H, the number of hidden 
layer units; and C, the number of dimensions of the feature vector. Further, (Oi) h at step 9202 is the h-th com- 
ponent of the threshold vector ©J; (Ws)hc at step 9205 is the element on the h-th row and the c-th column of 
the coupling coefficient matrix Ws; and x B represent the numbers of feature vectors of input speech used for 
the aforementioned forward prediction and backward prediction, respectively, t f = 2 and t b = 1 being used, 

45 to be specific. The calculation shown in FIG. 9 gives the predicted vector at as a vector arrangement Z having 
C components. 

Next, referring further to FIG. 6, at step 602 is calculated the distance d ( f (n) from the input speech feature 
vector at at time t and the predicted vector a\ by the MLP predictor accompanying the n-th status of category , 
calculated at step 601. At step 10301 of FIG. 10 which shows in further detail this calculation represented by 
50 the above cited equation (3), 



55 

is the determinant of the covariance matrix in the n-th state of category s. The method to calculate the deter- 
minant is not explained here as it is evident from elementary linear algebra. At step 10301 , the natural logarithm 
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of the determinant of the covariance matrix is substituted for a variable X. Variables Y and X in FIG. 10 are 
both arranged in the C-dimension. At step 1 0305, (aj^ is the c -th component of the input speech feature vector 
a^ and (a^n))^ is the predicted vector by the MLP predictor accompanying the n-th state of category s, both 
calculated at step 601. At step 10306, 

is the component on the Cfth row and the C2-th column of the inverse matrix of the covariance matrix 




The method to calculate the inverse matrix is not explained here as it is evident from elementary linear algebra. 
Processing illustrated in FIG. 10 stores the value of the local distance d'f(n) into the variable X 

By the processing up to step 414 in FIG. 4, the global distance D between the featu* vector time series 
pattern of input speech and the reference pattern model is calculated. At this time, the global distance D be- 
tween the above mentioned patterns of category s is obtained as the accumulated prediction residual gr(N< s >) 
of the final status N( s > of each reference pattern model at time T (the terminal point of one time series pattern). 

The recognition result output section 81 05 selects the shortest of the distances between the feature vector 
time series pattern of input speech and the reference pattern model of each category given from the distance 
calculator 8104, and supplies its category name as the recognition result. Its specific processing is step 415 
in FIG. 4. 

A training speech database storage section 8106 stores the speech data of all the word/syllable categories 
which are the objects of recognition, i.e. the time series patterns of feature vectors corresponding to each cat- 
egory. 

A reference pattern corrector 8107 calculates the quantities of correction required for the parameters of 
the reference pattern model of each category read in from the reference pattern storage section 8103 on the 
basis of training speech data from the training speech database storage section 8106, and corrects the afore- 
mentioned parameters corresponding to the reference patterns stored in the reference pattern storage section 
8103. 

Referring to FIG. 7 illustrating that signal processing, the parameters of the reference pattern models of 
all the categories (including the inter-unit coupling coefficients of MLP predictors in different statuses, threshold 
vectors and covariance matrixes) are initialized with random numbers at step 701 . This processing is illustrated 
in detail in FIG. 11. Here, s is a variable representing the category of the object of recognition, and takes, where 
10 numerals are to be recognized, one of integral values from 1 through 10; and n is a variable representing 
the n-th status of the reference pattern model of the s-th category, and takes one of integral values from 1 
through N< s >. At step 1 1406, (Wt) h c is the element on the h-th row and the c-th column of the t-th inter-unit cou- 
pling matrix for forward prediction of the MLP predictor accompanying the n-th status of the reference pattern 
model of the s-th category. Here, "random" denotes random numbers, which specifically are uniform random 
numbers ranging from -0.3 to 0.3. Similarly at the following steps 11416, 11424, 11429 and 11432, the object 
parameter is supposed to be the parameter of the n-th status of the reference pattern model of the s-th category. 
At step 1 1439 are initialized variables D1 and D2, to be subsequently used for convergence decision at step 
712, and another variable P. 

At step 704, the optimal trajectories between the m-th set of training data of the s-th category (the feature 
vector time series of V^in length) and the reference pattern models of the s-th category are figured out Details 
of this processing are shown in FIG. 12. The initializing section 2 of step 12501 results from the fixing of the 
variable s representing the category to be processed in the initialization given in FIG. 4 (step 401). More spe- 
cifically, this processing is achieved by eliminating steps 502. 507. 508, 511. 513 and 514 of FIG. 5. The dis- 
tance calculating section 2 of step 12502 results from the fixing of the variable s representing the category to 
be processed in the distance calculating process from steps 402 through 414 given in FIG. 4. More specifically, 
this processing is accomplished by removing steps 403, 41 1 and 412. For subsequent convergence decision, 
the accumulated prediction residual gr } (N( s >) calculated by the processing up to step 414 is added in advance 
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to the variable D1 for convergence decision (D1 = D1 + g^Nts*). The optimal trajectories n 1( n T are obtained 
by the processing from steps 12503 through 12510. 

Referring again to FIG. 7, at step 706, the parameters of the MLP predictor accompanying the nj-th status, 
given correspondence at step 704 to the feature vector at of the m-th set of training data of the s-th category 
at time t, are corrected by back propagation. Details of this processing are shown in FIG. 13. At step 13602 is 
calculated the predicted vector for the feature vector at time t. This process is shown in FIG. 9 (referred to 
above). In the processing at the following steps 13603 through 13634, Y is an H-dimensional arrangement rep- 
resenting the output of the hidden layer unit calculated at step 13602; AZ, a C-dimensional arrangement rep- 
resenting the error regarding the output layer unit; AY, an H-dimensional arrangement representing the error 
regarding the hidden layer unit; and s, a learning coefficient given in advance (specifically taking the value of 

0. 1 or the like). Here, the nondiagonal terms of the covariance matrix are negligibly small, and the matrix ac- 
cordingly is treated as a diagonal one, whose c-th diagonal component is 

<*n>cc 

at step 13607. By the processing shown in FIG. 13, the parameters of the nj-th MLP predictor of the s-th cat- 
egory are so corrected as to reduce the prediction residuals. By the processing from steps 703 through 710, 
the above described corrective training is applied to all the sets of training data of the s-th category. 

At step 71 1, a new covariance matrix is calculated on the basis of equation (16) cited above. The process 
at step 71 1 is illustrated in FIG. 14, wherein variables X and Y are arrangements of N< s > rows and C columns, 
respectively. The optimal trajectory calculation at step 14709 is processed in the same manner as step 704, 
as shown in detail in FIG. 12. Processing by the predicted vector calculating section at step 14712 is the same 
as step 13602, whose details are shown in FIG. 9. The sign (a t (n)) at step 17414 denotes the c-th component 
of the predicted vector calculated at step 14712, and 

< S n>cc 

at step 14724, the c-th diagonal component of the the covariance matrix of the n-th status. 

Conversion decision at step 712 recognizes convergence if the calculated varying rate of the variable D1 
(the accumulated prediction residua! for all the sets of training data) is found smaller than a threshold given in 
advance. The processing is shown in FIG. 15. At step 15802, the absolute value of the varying rate of the ac- 
cumulated prediction residual for all the sets of training data is compared with a threshold T h given in advance 
(actually 0.001 or the like). By the processing at these steps 701 through 417, iterative training is carried out 
for all the sets of training data to give the optimal model parameters. 

As hitherto described, the speech recognition method according to the present invention is characteristic 
in that finite status transition networks of the left-to-right pattern accompanied by MLP predictors using neural 
networks compose NPM's, which are reference pattern models involving both forward and backward prediction, 
the local distances between these NPM's and the time series patterns of input speech feature vectors are cal- 
culated by DP matching, and covariance matrixes are introduced into this calculation of local distances. As a 
result, the speech recognition method according to the invention is adaptable to the speech of any unspecified 
speaker and makes possible speech recognition, in particular continuous speech recognition, at a high rate of 
recognition. 

Although the invention has been described with reference to a specific embodiment (or specific embodi- 
ments), this description is not meant to be construed in a limiting sense. Various modifications of the disclosed 
embodiment, as well as other embodiments of the invention, will become apparent to persons skilled in the art 
upon reference to the description of the invention. It is therefore contemplated that the appended claims will 
cover any such modifications or embodiments as fall within the true scope of the invention. 

Claims 

1. A pattern recognition method for recognizing syllables and sound elements on the basis of comparison of 
input time series patterns expressed as the feature vectors of the syllables and sound elements with ref- 
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erence pattern models using a finite status transition network, wherein each status of said finite status tran- 
sition network has a predictor for calculating a predicted vector at time t from a plurality of feature vectors 
of said input time series patterns at time M. and before and a plurality of feature vectors at time t+1 and 
after; and a prediction residual determined by said input feature vector, said predicted feature vector by 
the predictor of said n-th status at time t, corresponding to said feature vector, and a covariance matrix 
accompanying said nth status is used as the local distance between the feature vector of said input time 
series patterns at time t, i.e. the input feature vector, and the n-th status of said finite status transition net- 
work. 

A speech recognition method, as claimed in Claim 1 f wherein correspondences between said input time 
series patterns and said predictors are so figured out by dynamic programming as to minimize the accu- 
mulated value of said local distances along said status transitions, and said accumulated value is used 
as the distance between said input time series patterns and said reference pattern models. 

A speech recognition method, as claimed in Claim 2, wherein said accumulated value is calculated for 
every category of words, and the category having the smallest of said accumulated values which have 
been calculated is used as the recognition output. 

A speech recognition method, as claimed in Claim 2 or 3, wherein initial values are set for the parameters 
of said predictor and said covariance matrix accompanying each status of said finite status transition net- 
work, said distance between said input time series pattern for the learning purpose, said category of which 
is known, and said reference pattern model corresponding to the same category as said known category 
is calculated; the parameters of said predictor and said covariance matrix of each state are iteratively cor- 
rected in the direction of reducing said distance; and said reference pattern model said distance for which 
satisfies predetermined conditions of convergence is thereby obtained. 
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